DL II Math Final Project Report¶

Title: Creating 3D images/clips from 2D images¶

Group Members:¶

Dusan Birtasevic
Kavian Mashayekhi
Narjes Amusoltani
Tina Khazaee


1- Abstract¶

This computer vision project introduces a method for generating facial 3D images/clips from 2D images, with a focus on enhancing facial details and creating captivating 3D animations.
The pipeline combines YOLOv8 for face detection, an Efficient Sub-Pixel Convolutional Neural Network (CNN) for Image Super-Resolution, Real ESRGAN for realistic enhancement, and the First Order Motion Model for animation synthesis.

Initially, YOLOv8 accurately localizes and extracts facial regions from input 2D images. These faces are then cropped and passed to the Efficient Sub-Pixel CNN, a powerful network that reconstructs high-resolution facial images, significantly improving facial quality and detail.

Input Image:

Input Image

Prediction (yolov8_face_detection Notebook):

Input Image

Bounding Box Crop (yolov8_face_detection Notebook):

Input Image

To further enhance the upscaled images, Real ESRGAN, a specialized Generative Adversarial Network (GAN) for super-resolution tasks, is utilized to generate visually realistic and finely-detailed facial representations.

Lastly, the First Order Motion Model breathes life into the enhanced facial images, allowing for the creation of dynamic 3D animations. The model transfers facial movements from source videos to the improved 3D facial representations, resulting in realistic and captivating visualizations.

Enhanced Image (ESRGAN_Image_Enhancement Notebook):

Input Image

Animate 2d Picture (First Order Motion Notebook):

Your browser does not support the video tag.

Enhance Video (ESRGAN_Video_Enhancement Notebook):

Your browser does not support the video tag.

Our primary motivation was to enhance facial details in 2D images and produce compelling 3D animations. The potential applications of this technology range from aiding suspect identification for law enforcement to capturing evidence of thieves caught on security cameras.

Through comprehensive experimentation and evaluation, our approach demonstrates substantial improvements in facial details, animation realism, and overall visual appeal. This project contributes to the field of computer vision by opening up possibilities for advancements in facial enhancement and animation synthesis, with promising applications in security, entertainment, and various other domains.


2- Introduction¶

Our project focuses on the fascinating task of transforming 2D images into realistic and dynamic 3D representations. The process of converting flat images into immersive 3D scenes presents a challenging problem due to the absence of depth information in 2D format. To address this issue, we aim to develop an efficient and user-friendly solution that automates the generation of 3D images and clips, making it accessible to a wide range of users.

2-1- Problem Statement¶

Creating 3D content traditionally involves labor-intensive manual processes and specialized software. This limits its widespread adoption and inhibits its potential impact across various industries. Our project seeks to overcome the barriers associated with 3D content creation by developing an automated approach that requires minimal user intervention, thus democratizing the accessibility of 3D visuals.

2-2- Importance¶

The ability to produce 3D images and clips from 2D sources holds great significance for industries like entertainment, education, design, and marketing. Enabling a broader audience, including non-experts, to generate 3D content can lead to the proliferation of more engaging and interactive media. Moreover, by reducing the time and skill requirements, our solution can empower creative professionals and businesses to enhance their visual communication and storytelling capabilities.

2-3- Overview of Result¶

Through our research and development, we have devised an innovative algorithm that utilizes advanced computer vision and deep learning techniques. Our algorithm achieves impressive results by accurately inferring depth information from 2D images and translating it into compelling 3D renditions. The generated 3D images and clips exhibit a convincing level of realism and immersion, mirroring the characteristics of manually crafted 3D content.


3- Related Works and sources¶

For this project to be done, we implemented different other project, tried to tailor them and fine tune them to achieve the best possible result.
At first, the idea behind this project comes from My Heritage Deep Nostalgia and this project is a mimic of that.

Secondly, we took advantage of First Order Motion Model for Image Animation as our core model for creating 3D images/clips from 2D.

In addition, Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data paper and the linked notebook was used and fine tuned in our work.

Also we tried to use Image Super-Resolution using an Efficient Sub-Pixel CNN in order to enhance the detail of our input images first and then pass them into the 2D to 3D model.


4- Data¶

4-1- YOLOv8¶

For YOLO, pre-trained weights model was used.model location: https://github.com/akanametov/yolov8-face

4-2- Image Super-Resolution using an Efficient Sub-Pixel CNN Dataset¶

In order for image enhancement part, BSDS500 (Berkeley Segmentation Dataset 500) was implemented. This dataset is designed for evaluating natural edge detection that includes not only object contours but also object interior boundaries and background boundaries. It includes 500 natural images with carefully annotated boundaries collected from multiple users.
The structure of this dataset was so unique and we were able to retrieve the required data using this source.

4-3- Image Super-Resolution using an Efficient Sub-Pixel CNN Fine Tuning Dataset¶

In order to fine tune the trained model in "Image Enhancement" notebook, we decided to do the fine tuning on a dataset from kaggle.
The chosen dataset was the following dataset:
https://www.kaggle.com/datasets/sanidhyak/human-face-emotions

This dataset has over 250 of Sad, Angry and Happy face images. All of these categories were merged and splited to train, test and validation for fine tuning the pre-trained model.


5- Methods¶

5-1- YOLOv8¶

The provided code implements face detection using YOLOv8. The pre-trained YOLOv8 face detection model is loaded, and an image is uploaded for processing. The model predicts faces' bounding box coordinates, which are then visualized on the original image. Additionally, the code crops and saves individual face images based on the detected bounding boxes. This methodology enables efficient face detection and provides the ability to extract and save individual face images for further analysis or processing.

5-2- Image Super-Resolution using an Efficient Sub-Pixel CNN¶

The other step before feeding the image into the model for 3D image generation was to enhance the image details and resolution. The reason of doing this step is that the generator model performance and output quality is directly related and depended to the quality of input image.

So, we decided to do this step after detection with YOLO and before feeding the image to the model in order to achieve a better output from the main model.

5-3- Real-ESRGAN¶

image.png

Real-ESRGAN is a model designed for upscaling and enhancing real-world images with state-of-the-art performance. It builds on the principles of Generative Adversarial Networks (GANs) and Efficient Sub-pixel Convolutional Neural Networks (ESPCN), and leverages a combination of techniques to achieve its results. Here's how it works:

  1. Architecture: Real-ESRGAN combines a generator network and a discriminator network. The generator network is tasked with upsampling an input image, while the discriminator network tries to distinguish the upscaled images from real high-resolution ones. This adversarial process pushes the generator to produce increasingly convincing high-resolution images.

  2. ESPCN (Efficient Sub-Pixel Convolutional Neural Network): ESPCN focuses on increasing spatial resolution efficiently. By using sub-pixel convolution, the network rearranges elements in the high-dimensional space into the spatial high-resolution domain. It's a computationally efficient way of achieving higher resolution without resorting to heavy transpose convolutions. It also uses filters to synthesize ringin and overshoot artifacts for training pairs. image.png where (i, j) is the kernel coordinate; ωc is the cutoff frequency; and J1 is the first order Bessel function of the first kind

  3. Pre-trained Models and Fine-tuning: Real-ESRGAN usually leverages pre-trained models and fine-tunes them on specific tasks or domains. This transfer learning approach significantly reduces the training time and resources needed.

  4. Loss Functions: It uses several loss functions to train the generator, including:

    • Content Loss: Measures the difference between the generated image and the target high-resolution image, often using a perceptual loss based on deep feature maps.
    • Adversarial Loss: Used to make the generated images indistinguishable from real high-resolution images. It's a fundamental part of the GAN framework.
    • Texture Match Loss: By focusing on texture details, this loss ensures that the generated image maintains the rich textures present in the real-world images.
  5. Real-world Degradation Modeling: One innovation in Real-ESRGAN is its focus on real-world image degradation instead of synthetic degradations (e.g., downscaling with simple filters). The training dataset incorporates real-world low-quality images, allowing the network to learn the complex and various degradations present in real scenarios. image-2.png

  6. Results: The final result is an enhanced and upscaled image that not only has higher pixel resolution but also exhibits improvements in sharpness, detail, and overall visual quality.

Real-ESRGAN represents a combination of several advanced techniques in machine learning and computer vision. It offers significant improvements over earlier super-resolution models, especially when dealing with real-world, non-ideal, and varied images. Its adaptability and performance make it valuable in various applications, ranging from restoring old films to enhancing satellite imagery.

5-4- First Order Motion Model¶

image.png

The First Order Motion Model is a deep learning approach designed to animate a given image using the motion extracted from a driving video. It's widely used for applications like facial animation, video editing, and avatar creation. Here's how the First Order Motion Model works:

  1. Input: The model takes two primary inputs:

    • A source image, which you want to animate.
    • A driving video, from which the model will extract motion information.
  2. Motion Extraction: The driving video is fed into the model to extract motion information. This is usually done by employing a set of keypoints that represent the essential parts of the image (such as facial landmarks if the image is a face).

  3. Keypoint Detector: A specific module called a keypoint detector is used to find keypoints in both the source image and the driving video. This helps in understanding the structure and motion between the two.

  4. Keypoint Descriptor: Besides the keypoints, a local region around each keypoint, referred to as a descriptor, is extracted. This helps in understanding the texture and appearance around the keypoints.

  5. Motion Representation: The motion between the source image and the driving video is represented as a sparse set of keypoints and dense local changes in the appearance around the keypoints. The model calculates the difference in keypoints and descriptors between the source image and each frame of the driving video. Formally, the transformation of the ( k )-th region from the reference frame to the image is computed as:$$ A^k_X\leftarrow_R \in \mathbb{R}^{2 \times 3} $$

image.png

  1. Generation of the Animation: Using the calculated motion information, the model animates the source image. It employs a generator network that takes the source image, the extracted keypoints, and the motion representation to generate the animated sequence.

image-2.png

  1. Loss Functions: The training process uses several loss functions:

    • Reconstruction Loss: Measures how well the model reconstructs the driving video when it's also used as the source image.
    • Keypoint Consistency Loss: Ensures that keypoints are consistent across different frames.
    • Adversarial Loss: Helps in making the generated frames indistinguishable from real video frames.
  2. Result: The output is a sequence of images or a video where the source image is animated according to the motion extracted from the driving video.

What's unique about the First Order Motion Model is that it doesn't require paired training data, meaning you don't need examples of the source image and the corresponding animated sequence during training. This makes the model highly flexible and applicable to a wide variety of images and motions.


6- Experiments¶

6-1- YOLO¶

image.png

In this experiment, we focused on implementing YOLOv8 for face detection as part of our 2D to 3D model pipeline. We obtained the YOLOv8 face detection model from a public GitHub repository (https://github.com/akanametov/yolov8-face) and used it to perform face detection on uploaded images. The model was loaded and used to predict bounding box coordinates for detected faces, which were then visualized on the original images. Additionally, we extracted and saved individual face images based on the detected bounding boxes.

This allowed us to efficiently detect faces in input images and obtain higher-resolution face crops for further processing. The YOLOv8 model's real-time capabilities and state-of-the-art performance made it a valuable component in our 2D to 3D model, enabling accurate and detailed face detection, ultimately improving the quality and resolution of the input images used in the 2D to 3D generation process. The results of this experiment were promising and contributed significantly to the overall success of our 2D to 3D model.

image-2.png

6-2- Image Super-Resolution using an Efficient Sub-Pixel CNN¶

As it discussed and explained before, in this step we decided to implement a CNN model to increase the resolution and image quality of our input image before feeding it to the model for 3D generation.

The source of training this model was a notebook for enhancement of picture details that could be found here: Image Super-Resolution using an Efficient Sub-Pixel CNN.

We implemented this Efficient Sub-Pixel CNN to increase the details of the 2D images input of our 2D to 3D model.

Here is the link to the complete notebook: LINK

Below, we will discuss the main part of the implementation.

6-2-1- Data Pre-processing¶

One of the most important parts of this training was to prepare the dataset in a way that we have low resolution images from one hand and have the original high quality images on the other hand. This could make our model able to be trained and it was also a good metric for us to evaluate the performance of the model.

So, for pre-processing first we changed the color space from RGB to YUV.

For the input data (low-resolution images), we crop the image, retrieve the y channel (luninance), and resize it with the area method (use BICUBIC if you use PIL). We only consider the luminance channel in the YUV color space because humans are more sensitive to luminance change.

For the target data (high-resolution images), we just crop the image and retrieve the y channel.

In [ ]:
# Use TF Ops to process:
def process_input(input, input_size, upscale_factor):
    input = tf.image.rgb_to_yuv(input)
    last_dimension_axis = len(input.shape) - 1
    y, u, v = tf.split(input, 3, axis=last_dimension_axis)
    return tf.image.resize(y, [input_size, input_size], method="area")


def process_target(input):
    input = tf.image.rgb_to_yuv(input)
    last_dimension_axis = len(input.shape) - 1
    y, u, v = tf.split(input, 3, axis=last_dimension_axis)
    return y


train_ds = train_ds.map(
    lambda x: (process_input(x, input_size, upscale_factor), process_target(x))
)
train_ds = train_ds.prefetch(buffer_size=32)

valid_ds = valid_ds.map(
    lambda x: (process_input(x, input_size, upscale_factor), process_target(x))
)
valid_ds = valid_ds.prefetch(buffer_size=32)

So, using the above pre-processing resulted in the following low-resolution images comparing the original ones.

In [ ]:
for batch in train_ds.take(1):
    for img in batch[0]:
        display(array_to_img(img))
    for img in batch[1]:
        display(array_to_img(img))

6-2-2- CNN Model¶

Then we defined a model as below:

In [ ]:
def get_model(upscale_factor=3, channels=1):
    conv_args = {
        "activation": "relu",
        "kernel_initializer": "Orthogonal",
        "padding": "same",
    }
    inputs = keras.Input(shape=(None, None, channels))
    x = layers.Conv2D(64, 5, **conv_args)(inputs)
    x = layers.Conv2D(64, 3, **conv_args)(x)
    x = layers.Conv2D(32, 3, **conv_args)(x)
    x = layers.Conv2D(channels * (upscale_factor ** 2), 3, **conv_args)(x)
    outputs = tf.nn.depth_to_space(x, upscale_factor)

    return keras.Model(inputs, outputs)

The summary of our model defined to be as follows:

In [ ]:
early_stopping_callback = keras.callbacks.EarlyStopping(monitor="loss", patience=10)

checkpoint_filepath = "/tmp/checkpoint"

model_checkpoint_callback = keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=True,
    monitor="loss",
    mode="min",
    save_best_only=True,
)

model = get_model(upscale_factor=upscale_factor, channels=1)
model.summary()

callbacks = [ESPCNCallback(), early_stopping_callback, model_checkpoint_callback]
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.Adam(learning_rate=0.001)
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, None, None, 1)]   0         
                                                                 
 conv2d (Conv2D)             (None, None, None, 64)    1664      
                                                                 
 conv2d_1 (Conv2D)           (None, None, None, 64)    36928     
                                                                 
 conv2d_2 (Conv2D)           (None, None, None, 32)    18464     
                                                                 
 conv2d_3 (Conv2D)           (None, None, None, 9)     2601      
                                                                 
 tf.nn.depth_to_space (TFOp  (None, None, None, 1)     0         
 Lambda)                                                         
                                                                 
=================================================================
Total params: 59657 (233.04 KB)
Trainable params: 59657 (233.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

6-2-3- Monitoring the fitting¶

Using defined some utility functions from the reference of this work, we were able to detect fitting process step by step and observe the performance of the model after each 20 epochs.

In [ ]:
epochs = 100

model.compile(
    optimizer=optimizer, loss=loss_fn,
)

model.fit(
    train_ds, epochs=epochs, callbacks=callbacks, validation_data=valid_ds, verbose=2
)

# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)
Epoch 1/100
Mean PSNR for epoch: 21.69
1/1 [==============================] - 0s 102ms/step
50/50 - 17s - loss: 0.0298 - val_loss: 0.0068 - 17s/epoch - 344ms/step
Epoch 2/100
Mean PSNR for epoch: 24.64
50/50 - 17s - loss: 0.0051 - val_loss: 0.0033 - 17s/epoch - 348ms/step
Epoch 3/100
Mean PSNR for epoch: 25.54
50/50 - 17s - loss: 0.0036 - val_loss: 0.0029 - 17s/epoch - 334ms/step
Epoch 4/100
Mean PSNR for epoch: 26.23
50/50 - 17s - loss: 0.0031 - val_loss: 0.0027 - 17s/epoch - 332ms/step
Epoch 5/100
Mean PSNR for epoch: 25.97
50/50 - 16s - loss: 0.0030 - val_loss: 0.0026 - 16s/epoch - 328ms/step
Epoch 6/100
Mean PSNR for epoch: 26.20
50/50 - 17s - loss: 0.0029 - val_loss: 0.0025 - 17s/epoch - 332ms/step
Epoch 7/100
Mean PSNR for epoch: 26.24
50/50 - 17s - loss: 0.0028 - val_loss: 0.0025 - 17s/epoch - 334ms/step
Epoch 8/100
Mean PSNR for epoch: 26.18
50/50 - 16s - loss: 0.0028 - val_loss: 0.0025 - 16s/epoch - 329ms/step
Epoch 9/100
Mean PSNR for epoch: 26.35
50/50 - 17s - loss: 0.0029 - val_loss: 0.0025 - 17s/epoch - 336ms/step
Epoch 10/100
Mean PSNR for epoch: 26.20
50/50 - 17s - loss: 0.0028 - val_loss: 0.0024 - 17s/epoch - 345ms/step
Epoch 11/100
Mean PSNR for epoch: 26.32
50/50 - 22s - loss: 0.0027 - val_loss: 0.0024 - 22s/epoch - 448ms/step
Epoch 12/100
Mean PSNR for epoch: 26.05
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 358ms/step
Epoch 13/100
Mean PSNR for epoch: 26.21
50/50 - 18s - loss: 0.0028 - val_loss: 0.0024 - 18s/epoch - 359ms/step
Epoch 14/100
Mean PSNR for epoch: 26.40
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 356ms/step
Epoch 15/100
Mean PSNR for epoch: 26.47
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 357ms/step
Epoch 16/100
Mean PSNR for epoch: 26.41
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 352ms/step
Epoch 17/100
Mean PSNR for epoch: 25.82
50/50 - 18s - loss: 0.0029 - val_loss: 0.0024 - 18s/epoch - 353ms/step
Epoch 18/100
Mean PSNR for epoch: 26.41
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 351ms/step
Epoch 19/100
Mean PSNR for epoch: 26.77
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 352ms/step
Epoch 20/100
Mean PSNR for epoch: 26.58
50/50 - 23s - loss: 0.0027 - val_loss: 0.0024 - 23s/epoch - 467ms/step
Epoch 21/100
Mean PSNR for epoch: 26.52
1/1 [==============================] - 0s 57ms/step
50/50 - 22s - loss: 0.0026 - val_loss: 0.0024 - 22s/epoch - 434ms/step
Epoch 22/100
Mean PSNR for epoch: 26.22
50/50 - 23s - loss: 0.0026 - val_loss: 0.0023 - 23s/epoch - 460ms/step
Epoch 23/100
Mean PSNR for epoch: 26.60
50/50 - 21s - loss: 0.0027 - val_loss: 0.0023 - 21s/epoch - 422ms/step
Epoch 24/100
Mean PSNR for epoch: 26.56
50/50 - 18s - loss: 0.0027 - val_loss: 0.0024 - 18s/epoch - 353ms/step
Epoch 25/100
Mean PSNR for epoch: 27.01
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step
Epoch 26/100
Mean PSNR for epoch: 26.30
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step
Epoch 27/100
Mean PSNR for epoch: 27.18
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step
Epoch 28/100
Mean PSNR for epoch: 26.40
50/50 - 18s - loss: 0.0026 - val_loss: 0.0024 - 18s/epoch - 352ms/step
Epoch 29/100
Mean PSNR for epoch: 26.63
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 352ms/step
Epoch 30/100
Mean PSNR for epoch: 26.43
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 353ms/step
Epoch 31/100
Mean PSNR for epoch: 26.01
50/50 - 17s - loss: 0.0028 - val_loss: 0.0024 - 17s/epoch - 346ms/step
Epoch 32/100
Mean PSNR for epoch: 26.44
50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 33/100
Mean PSNR for epoch: 26.89
50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 34/100
Mean PSNR for epoch: 26.50
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 348ms/step
Epoch 35/100
Mean PSNR for epoch: 26.69
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 351ms/step
Epoch 36/100
Mean PSNR for epoch: 26.83
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 37/100
Mean PSNR for epoch: 26.58
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 38/100
Mean PSNR for epoch: 26.76
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 347ms/step
Epoch 39/100
Mean PSNR for epoch: 26.65
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 347ms/step
Epoch 40/100
Mean PSNR for epoch: 26.40
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 357ms/step
Epoch 41/100
Mean PSNR for epoch: 26.45
1/1 [==============================] - 0s 47ms/step
50/50 - 18s - loss: 0.0026 - val_loss: 0.0023 - 18s/epoch - 364ms/step
Epoch 42/100
Mean PSNR for epoch: 26.68
50/50 - 19s - loss: 0.0026 - val_loss: 0.0023 - 19s/epoch - 381ms/step
Epoch 43/100
Mean PSNR for epoch: 26.80
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 44/100
Mean PSNR for epoch: 26.84
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 45/100
Mean PSNR for epoch: 26.48
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 46/100
Mean PSNR for epoch: 26.26
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 47/100
Mean PSNR for epoch: 26.58
50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 350ms/step
Epoch 48/100
Mean PSNR for epoch: 26.32
50/50 - 17s - loss: 0.0027 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 49/100
Mean PSNR for epoch: 26.54
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 50/100
Mean PSNR for epoch: 26.42
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 347ms/step
Epoch 51/100
Mean PSNR for epoch: 26.67
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 347ms/step
Epoch 52/100
Mean PSNR for epoch: 26.45
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 348ms/step
Epoch 53/100
Mean PSNR for epoch: 26.91
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 345ms/step
Epoch 54/100
Mean PSNR for epoch: 26.56
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 346ms/step
Epoch 55/100
Mean PSNR for epoch: 26.91
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 56/100
Mean PSNR for epoch: 26.81
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step
Epoch 57/100
Mean PSNR for epoch: 26.70
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 349ms/step
Epoch 58/100
Mean PSNR for epoch: 26.45
50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 354ms/step
Epoch 59/100
Mean PSNR for epoch: 26.82
50/50 - 17s - loss: 0.0026 - val_loss: 0.0023 - 17s/epoch - 350ms/step
Epoch 60/100
Mean PSNR for epoch: 26.45
50/50 - 17s - loss: 0.0027 - val_loss: 0.0022 - 17s/epoch - 347ms/step
Epoch 61/100
Mean PSNR for epoch: 26.75
1/1 [==============================] - 0s 46ms/step
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 368ms/step
Epoch 62/100
Mean PSNR for epoch: 26.27
50/50 - 19s - loss: 0.0025 - val_loss: 0.0022 - 19s/epoch - 384ms/step
Epoch 63/100
Mean PSNR for epoch: 26.37
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step
Epoch 64/100
Mean PSNR for epoch: 27.33
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 349ms/step
Epoch 65/100
Mean PSNR for epoch: 26.89
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step
Epoch 66/100
Mean PSNR for epoch: 25.87
50/50 - 17s - loss: 0.0027 - val_loss: 0.0026 - 17s/epoch - 346ms/step
Epoch 67/100
Mean PSNR for epoch: 26.67
50/50 - 18s - loss: 0.0026 - val_loss: 0.0022 - 18s/epoch - 356ms/step
Epoch 68/100
Mean PSNR for epoch: 26.70
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 69/100
Mean PSNR for epoch: 26.46
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step
Epoch 70/100
Mean PSNR for epoch: 26.71
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 346ms/step
Epoch 71/100
Mean PSNR for epoch: 26.62
50/50 - 17s - loss: 0.0025 - val_loss: 0.0023 - 17s/epoch - 343ms/step
Epoch 72/100
Mean PSNR for epoch: 26.90
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 73/100
Mean PSNR for epoch: 26.73
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step
Epoch 74/100
Mean PSNR for epoch: 26.54
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 354ms/step
Epoch 75/100
Mean PSNR for epoch: 26.88
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 351ms/step
Epoch 76/100
Mean PSNR for epoch: 26.55
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 351ms/step
Epoch 77/100
Mean PSNR for epoch: 24.51
50/50 - 17s - loss: 0.0028 - val_loss: 0.0036 - 17s/epoch - 344ms/step
Epoch 78/100
Mean PSNR for epoch: 26.89
50/50 - 17s - loss: 0.0029 - val_loss: 0.0022 - 17s/epoch - 346ms/step
Epoch 79/100
Mean PSNR for epoch: 26.93
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step
Epoch 80/100
Mean PSNR for epoch: 27.01
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 81/100
Mean PSNR for epoch: 26.90
1/1 [==============================] - 0s 47ms/step
50/50 - 18s - loss: 0.0025 - val_loss: 0.0022 - 18s/epoch - 367ms/step
Epoch 82/100
Mean PSNR for epoch: 26.66
50/50 - 19s - loss: 0.0025 - val_loss: 0.0022 - 19s/epoch - 382ms/step
Epoch 83/100
Mean PSNR for epoch: 26.85
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 342ms/step
Epoch 84/100
Mean PSNR for epoch: 26.70
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 85/100
Mean PSNR for epoch: 26.85
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step
Epoch 86/100
Mean PSNR for epoch: 26.18
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 87/100
Mean PSNR for epoch: 26.51
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 88/100
Mean PSNR for epoch: 26.40
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step
Epoch 89/100
Mean PSNR for epoch: 26.58
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 341ms/step
Epoch 90/100
Mean PSNR for epoch: 26.39
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 348ms/step
Epoch 91/100
Mean PSNR for epoch: 26.36
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 347ms/step
Epoch 92/100
Mean PSNR for epoch: 26.84
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 350ms/step
Epoch 93/100
Mean PSNR for epoch: 26.81
50/50 - 18s - loss: 0.0025 - val_loss: 0.0023 - 18s/epoch - 351ms/step
Epoch 94/100
Mean PSNR for epoch: 26.65
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step
Epoch 95/100
Mean PSNR for epoch: 26.09
50/50 - 17s - loss: 0.0025 - val_loss: 0.0024 - 17s/epoch - 343ms/step
Epoch 96/100
Mean PSNR for epoch: 26.47
50/50 - 17s - loss: 0.0027 - val_loss: 0.0022 - 17s/epoch - 345ms/step
Epoch 97/100
Mean PSNR for epoch: 26.39
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 346ms/step
Epoch 98/100
Mean PSNR for epoch: 26.33
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 343ms/step
Epoch 99/100
Mean PSNR for epoch: 26.58
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 344ms/step
Epoch 100/100
Mean PSNR for epoch: 27.09
50/50 - 17s - loss: 0.0025 - val_loss: 0.0022 - 17s/epoch - 345ms/step
<tensorflow.python.checkpoint.checkpoint.CheckpointLoadStatus at 0x16a616e90>

6-3- Image Super-Resolution using an Efficient Sub-Pixel CNN Fine Tuning¶

In order to fine tune the trained model in "Image Enhancement" notebook, we decided to do the fine tuning on a dataset from kaggle.
The chosen dataset was the following dataset:
https://www.kaggle.com/datasets/sanidhyak/human-face-emotions

Despite all of the efforts done, which could be seen below, we were not able to feed pictures to the pre-trained model because it needed a special kind of input. Although we tried to use the model pre-processing method, we were not able to feed the pictures to the model again.

Finally, we moved on and we decided to use the .h5 pre-trained model directly as image enhancement method.

The main related notebook could be found here: LINK

Also, a brief explanation of all the effort was done for this part could be seen below:

6-3-1- Creating Low Resolution Images¶

As per our main trained model for increasing the resolution, we tried to decrease the quality of the images by down scaling them with a factor of 0.5. As we had Jpeg, PNG, and gif, we made sure to cover all of them for this down scaling.

In [ ]:
input_dir = "/Users/kavian/Desktop/data/high_resolution_images"
output_dir = "/Users/kavian/Desktop/data/low_resolution_images"
scale_factor = 0.5  # Adjust this to set the desired low-resolution scale factor

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

for filename in os.listdir(input_dir):
    if filename.endswith(".jpg") or filename.endswith(".png") or filename.endswith(".gif") :
        img = Image.open(os.path.join(input_dir, filename))
        low_res_img = img.resize((int(img.width * scale_factor), int(img.height * scale_factor)), Image.LANCZOS)
        low_res_img.save(os.path.join(output_dir, filename))

6-3-2- Importing the model¶

Then after spliting dataset to test, train, and validation, we imported the model that was trained in the previous step. We let all of its layers to be re-trainable in order to have a better fine tuned model. However, the main reason that we allowed this was the fact that this model was not too complicated and deep, and it was possible to re-train all of the layers once again even using CPU.

In [ ]:
model = tf.keras.models.load_model('/Users/kavian/Desktop/GBC/Second Semester/6- DL II Math/Final Project/DLIIMathProject/Notebooks/Image Enhancement/Image_Enhancement_before_finetuning.h5')
In [ ]:
print(model.summary())
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, None, None, 1)]   0         
                                                                 
 conv2d (Conv2D)             (None, None, None, 64)    1664      
                                                                 
 conv2d_1 (Conv2D)           (None, None, None, 64)    36928     
                                                                 
 conv2d_2 (Conv2D)           (None, None, None, 32)    18464     
                                                                 
 conv2d_3 (Conv2D)           (None, None, None, 9)     2601      
                                                                 
 tf.nn.depth_to_space (TFOp  (None, None, None, 1)     0         
 Lambda)                                                         
                                                                 
=================================================================
Total params: 59657 (233.04 KB)
Trainable params: 59657 (233.04 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None

6-3-3- Fitting and Error¶

Finally, we tried to fit this model on our new dataset but we encountered the following error again and again:

In [ ]:
# Train the model on the new dataset
model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // batch_size,
    epochs=num_epochs
)
Epoch 1/10
---------------------------------------------------------------------------

UnimplementedError                        Traceback (most recent call last)

Cell In[16], line 2

      1 # Train the model on the new dataset

----> 2 model.fit(

      3     train_generator,

      4     steps_per_epoch=train_generator.samples // batch_size,

      5     validation_data=validation_generator,

      6     validation_steps=validation_generator.samples // batch_size,

      7     epochs=num_epochs

      8 )



File ~/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:70, in filter_traceback.<locals>.error_handler(*args, **kwargs)

     67     filtered_tb = _process_traceback_frames(e.__traceback__)

     68     # To get the full stack trace, call:

     69     # `tf.debugging.disable_traceback_filtering()`

---> 70     raise e.with_traceback(filtered_tb) from None

     71 finally:

     72     del filtered_tb



File ~/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/tensorflow/python/eager/execute.py:53, in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)

     51 try:

     52   ctx.ensure_initialized()

---> 53   tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,

     54                                       inputs, attrs, num_outputs)

     55 except core._NotOkStatusException as e:

     56   if name is not None:



UnimplementedError: Graph execution error:



Detected at node 'model/conv2d/Relu' defined at (most recent call last):

    File "<frozen runpy>", line 198, in _run_module_as_main

    File "<frozen runpy>", line 88, in _run_code

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel_launcher.py", line 17, in <module>

      app.launch_new_instance()

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/traitlets/config/application.py", line 1043, in launch_instance

      app.start()

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 725, in start

      self.io_loop.start()

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 195, in start

      self.asyncio_loop.run_forever()

    File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 604, in run_forever

      self._run_once()

    File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/base_events.py", line 1909, in _run_once

      handle._run()

    File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/asyncio/events.py", line 80, in _run

      self._context.run(self._callback, *self._args)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 513, in dispatch_queue

      await self.process_one()

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 502, in process_one

      await dispatch(*args)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 409, in dispatch_shell

      await result

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 729, in execute_request

      reply_content = await reply_content

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 422, in do_execute

      res = shell.run_cell(

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 540, in run_cell

      return super().run_cell(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3009, in run_cell

      result = self._run_cell(

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3064, in _run_cell

      result = runner(coro)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 129, in _pseudo_sync_runner

      coro.send(None)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3269, in run_cell_async

      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3448, in run_ast_nodes

      if await self.run_code(code, result, async_=asy):

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3508, in run_code

      exec(code_obj, self.user_global_ns, self.user_ns)

    File "/var/folders/qy/q7v66x7544q5l2_cnqm0_8y00000gn/T/ipykernel_6556/1510933377.py", line 2, in <module>

      model.fit(

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1742, in fit

      tmp_logs = self.train_function(iterator)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1338, in train_function

      return step_function(self, iterator)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1322, in step_function

      outputs = model.distribute_strategy.run(run_step, args=(data,))

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1303, in run_step

      outputs = model.train_step(data)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 1080, in train_step

      y_pred = self(x, training=True)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/training.py", line 569, in __call__

      return super().__call__(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__

      outputs = call_fn(inputs, *args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/functional.py", line 512, in call

      return self._run_internal_graph(inputs, training=training, mask=mask)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/functional.py", line 669, in _run_internal_graph

      outputs = node.layer(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1150, in __call__

      outputs = call_fn(inputs, *args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler

      return fn(*args, **kwargs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/layers/convolutional/base_conv.py", line 321, in call

      return self.activation(outputs)

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/activations.py", line 321, in relu

      return backend.relu(

    File "/Users/kavian/Desktop/venv/venv/tensorflow_cpu/lib/python3.11/site-packages/keras/src/backend.py", line 5397, in relu

      x = tf.nn.relu(x)

Node: 'model/conv2d/Relu'

Fused conv implementation does not support grouped convolutions for now.

	 [[{{node model/conv2d/Relu}}]] [Op:__inference_train_function_1425]

We encountered this error over and over. It seems that it goes back to the number of channels for the feeding image to the model. It should be grayscale.
We searched for this problem and found this:
https://stackoverflow.com/questions/61796021/unimplementederror-fused-conv-implementation-does-not-support-grouped-convoluti
and
https://stackoverflow.com/questions/73130599/tensorflow-fused-conv-implementation-does-not-support-grouped-convolutions

But they didn't solve our problem.

We guess that the problem should be realated to converting RGB to YUV. In the main model we did that, but we were not able to do that once again on our own dataset. So, probably the model needs to receive just Y, but we are feeding RGB to that.

We will work on that to solve this problem later.

6-3-3- Streamlit App¶

image.png

Our Streamlit app can be found at the following location: https://testrepo-y12d7zyos0a.streamlit.app/

The repo for the app is at the following address: https://github.com/dusanBirta/test_repo

The streamlit app functions in the following way. The user is prompted to upload a photo of a person to animate. The photo is passed to our yolov8 model, which identifies the face in the image, and then crops the face found in the bounding box. The image of the face found within the bounding box is then passed to the animate model. The animate model uses First Order Motion to create 3d animation on a 2d image. It works by using a driving video that contains motion on a face, and generates animation with the same motion on the 2d image. The code for the app can be found here: https://github.com/dusanBirta/test_repo/blob/main/app.py

Several issues arose when implementing the streamlit app.

Issue 1: Could not implement all the models into the app.

  • At first I attempted to implement every notebook into the streamlit app. This involved placing the necessary files into the main branch of the github streamlit repository by either cloning used repositories, such as first-order-motion, or real-ESRGAN, or placing the neccesary files to run those models into the main branch of the repo. The second approach was taken. Unfortunately implementing real-ESRGAN image and video enhancement into streamlit was not possible. The issue consisted in running the 'setup.py' file of the real-ESRGAN repository. I attempted to create bashs scripts to run the file, or '! python -m setup.py develop' in the app.py file, but not of these worked. So I moved on to implementing the First Order Motion model.

Issue 2: Cuda not available on streamlit cloud.

  • The second issue encountered envovled streamlit not having gpu access. The first order motion model was coded with gpu set to true to speed up the calculation, but I had difficulties discovering where this was. After several long attempts I discovered a simple solution after finding the file that had the issue. The solution simply involves changing 'cpu=False' to 'cpu=True' for the methods that were calling the gpu in the tracked down file 'demo.py'. Below is a snippet of the solution. image.png

Issue 3: Streamlit crashing after full implementation.

  • The major issue I had was streamlit crashing after full implementation. What I mean by full implementation: the yolov8 prediction containing bounding box and accuracy printed to the screen, taking into account pictures with multiple people in them, getting the yolov8 bounding boxes with pictures with multiple people in them and displaying them, allowing the user to select from a dropdown menu which person to select for animation, and finally generating and displaying the animation. I believe it kept crashing due to resource limits on streamlit, as I tracked the log through the streamlit terminal and it crashed when generating the animation. The solution to this issue was scaling down the streamlit app. Instead of full functionality, the app displays the croped image, and then generates the animation on it. This is the version of the app currently live on streamlit. Below is a snippet of the code for the full implementation. image-2.png

7- Conclusion¶

In conclusion, our project successfully addresses the challenge of transforming 2D images into captivating 3D representations through an automated and user-friendly approach. By combining YOLOv8 for face detection, a CNN model for image enhancement, and the First Order Motion Model for Image Animation, we have achieved impressive results in generating realistic 3D images and clips from 2D input.

Our innovative algorithm enables a wide range of users to create compelling 3D content without the need for labor-intensive manual processes or specialized software. With the democratization of 3D content creation, our solution holds promising potential across various industries, revolutionizing the way we interact with and experience visual media.

We also discovered that streamlit cloud is not always the best for deploying machine learning models, particularly when they require large resources and contain grater complexity, and for these sorts of apllications and models it may be better to use other options such as Gradio.

This project contributes to the advancement of computer vision and deep learning applications and opens up new possibilities for immersive storytelling and visualization.